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Abstract. Let -k w denote the failure function of the Morris-Pratt algo- 
rithm for a word w. In this paper we study the following problem: given 
an integer array A[l . . n], is there a word w over arbitrary alphabet E 
such that A[i] = ir w [i] for all il Moreover, what is the minimum cardinal- 
ity of E required? We give a real time linear algorithm for this problem 
in the unit-cost RAM model with <9(logn) bits word size. Our algorithm 
returns a word w over minimal alphabet such that ir w = A as well and 
uses just o(n) words of memory. Then we consider function ir' instead of 
7r and give an online O(nlogn) algorithm for this case. This is the first 
polynomial algorithm for online version of this problem. 

1 Introduction 

The Morris-Pratt algorithm [13], first linear time pattern matching algorithm, 
is well known for its simple and beautiful concept. It simulates a forward-prefix- 
scan DFA for pattern matching [2] by using a carefully chosen failure function n, 
also known as a border array. The algorithm utilizes values of n for all prefixes of 
the pattern. It behaves like the automaton in the sense that it reads each symbol 
of the text once and simulates the automaton's transition. The amortized time 
per transition is constant, and the required values of the prefix function can be 
calculated beforehand in linear time in a similar fashion. 

The failure function itself is of interest as, for instance, it captures all the 
information about periodicity of the word. Hence it is often used in word combi- 
natorics and numerous text algorithms, see [2, 4, 5]. The Morris-Pratt algorithm 
has many variants. In particular, the Knuth-Morris-Pratt algorithm [10] works in 
exactly the same manner, but uses a slightly different failure function, namely the 
KMP array it' (or strong failure function). The time bounds for KMP algorithm 
are precisely the same as for MP algorithm, but KMP has smaller upper bound 
on time spent processing a single letter — for KMP this bound is O(logm), 
whereas for MP it is 0{m), where m denotes the length of the pattern. 

We investigate the following problem: given an integer array A[l . . n], is there 
a word w over an arbitrary alphabet E such that A[i] — ir w [i] for all i, where ir w 
denotes the failure function of the Morris-Pratt algorithm for the word w. If so, 
what is the minimum cardinality of the alphabet E over which such word exists? 
Pursuing these questions is motivated by the fact that in word combinatorics 
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one is often interested only in values of ir w for every prefix of a word w rather 
than w itself. Thus it makes sense to ask if there is a word w that admits tt w = A 
for a given array A. Validation of border arrays is also an important building 
block of many algorithms that generate all valid border arrays [8, 9, 12]. 

We are interested in an online algorithm, i.e. one that receives the input 
array values one by one, and is required to output the answer after reading each 
such single value. The maximum time spent on processing a single piece of input 
is the delay of the algorithm. When the delay is constant, we call the algorithm 
real time. This and similar problems were addressed before by other researchers. 
Recently a linear online algorithm for (closely related) prefix array validation has 
been given [3]. A simple linear online algorithm for tt validation is known [6], 
though it has a min(n, \ S\) delay. Authors of [6] were unaware that in this case 
\E\ = O(logn) [12], hence the delay of this algorithm is in fact logarithmic. 

We provide an online real time algorithm working in unit-cost RAM model 
(i.e. we assume that words consisting of (9(logn) bits can be operated on in 
constant time) and using O(nloglogn) bits. We show that fi(ri) bits of space 
are necessary if the input is read-once only. 

Then we turn our attention to tt' . There is an offline linear bijective transfor- 
mation between tt and tt' . This transformation can be performed with access to 
the arrays only, i.e. with no access to the word itself. Thus it is possible to check 
offline whether there exists w such that A = tt' w in linear time. The task becomes 
much harder when an online algorithm is required. Our online algorithm, which 
is the first polynomial algorithm for the problem, has running time O(nlogn). 

This problem was investigated for a slightly different variant of tt' and an of- 
fline validation algorithm for this variant is known [6] . The function g considered 
therein can be expressed as g[n] — n'[n — 1] + 1. The aforementioned bijection 
between tt and it' cannot be applied to g as it essentially uses the unavailable 
value ir[n] = Tr'[n]. While instances on which the algorithm runs in time 0{n 2 ) 
are known, no polynomial upper bound on the algorithm's running time was 
provided in [6] . The algorithm for online tt' validation we provide can be applied 
to g validation with no changes. 

2 Preliminaries 

For w G S* , we denote its length by n(w) or simply n. For v, w G S* , by vw we 
denote the concatenation of v and w. We say that u is a prefix of w if there is 
v G £*, such that w = uv. Similarly, we call v a suffix of w if there is u G S* 
such that w — uv. A word v that is both a prefix and a suffix of w is called a 
border of w. By w[i] we denote the i-th letter of w and by w[i . . j] we denote the 
subword w[i]w[i + 1] . . . w[j] of w. We call a prefix (respectively: suffix, border) 
v of the word w proper if v =/= w, i.e. it is shorter than w itself. 

For a word w its failure function tt w is defined as follows: tt w [i\ is the length 
of the longest proper border of w[l . . i] for i = 1, 2 . . . , n. By 7r^ we denote ir w 
composed k times with itself, namely 7r^[i] := i and 7r£ fe+1 ' l [z] := 7r w [7rw^]]- 
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This convention applies to other functions as well. We omit the subscript w in 
n w , whenever it is unambiguous. 



We state some simple properties 
of borders and the prefix function. If 
u, v and w are words, such that |u| < 
\v\ < \w\ and v is a border of w, then 
u is a border of v if and only if u is a 
border of w. As a consequence, every 
border of w[l . .i] has length 7r£^[i] 
for some integer k > 0. 

The strong failure function tt' is 
defined as follows: 7r^,[n] := 7r^[n], 
and for i < n, 7r'[z] is the length of 
the longest (proper) border of w[l . . i], 
such that w \^ f w [i] + 1] ^ w[i+ 1]. If no 
such border exists, n'[i] = —1. 

It is well-known that ir w and Tr' w 
can be obtained from one another in 
linear time, using additional lookups 
in w to check conditions of the form 
w[i] — w\j\. What is perhaps less 
known, these lookups are not neces- 
sary, i.e. there is a bijection between 



Compute- ir(w) 
[1] <- , k 







7T. 

for i <— 2 to n do 

while k > and w[fc 
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if + 1] = then k 

n,n\i] <— fc 



1] 7^ u>[«] do 
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Fig. 1: Function 7r for 
The dashed 



aabaabaaabaac. 

represent the consecutive tries 
Compute- 7T when computing n w [i] 



word 
lines 
of 



tt w and n' w . Values of this function, as 
well as its inverse, can be computed in linear time. The correctness of the two 
procedures below follows from two (equivalent) observations: 



w\i 



1] ^ w[n[{\ + 1] 4 
1] = w[n[i\ + 1] 4 
Compute- 7r'-FROM-7r(7r) 
tt'[0] < 1, 7r'[n 



7r|nj 

for i <— 1 to n — 1 do 

if 7r[i + 1] = 7r[i] + 1 then 

Tr'\i] <- 7r'[7r[i]] 
else 7r'[il <— 7rfil 



7r[z + 1] < 7r[i] + 1 7r'[z] = 7t[z] . 

tt[z + 1] = 7r[i] + 1 7r'[i] = 7r'[7r[' 

Compute- 7r-FROM-7r'(7r') 
7r[n] 



(1) 
(2) 



7T [nj 

for i *— n — 1 downto 1 do 

7r[z] <— max{7r'[i], 7r[z + 1] 



1} 



3 Online border array validation 

Let T be a graph with vertices {1,2, . . . ,n} and directed edges {(k,n[k — 1] + 
1) : k = 2, . . . , n}, see Fig. 2 for an example. Observe that T is a directed tree: 
each vertex except the vertex 1 has exactly one outgoing edge, and since tt[i\ < i, 
the graph is acyclic and the vertex 1 is reachable from every other vertex. Thus 
vertex 1 is the root of T and all the edges are directed towards it. Therefore we 
use the standard notation f [i] to denote the unique out-neighbour of i for i > 1. 
We also call i' an ancestor of i > 1 if i' = f( k >[i] for some k > (note that i is 
not its own ancestor). 
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Define an analogue structure T' with ir' 
instead of tt: f'[i] = 7r' w [i - 1] + 1. If f'[i] = 
for some i then i has no outgoing edge in T'. 
Thus T" is a forest and not necessarily a tree. 
By (l)-(2) {'[i] can be expressed using n and 
/: 





(3) & 



Our approach to online border array val- 



Fig. 2: An example of tree T for 
a word from Fig. 1. 



idation is as follows: assume there is a word d wom lrom L - 
w admitting tt w = A and implicitly construct T for ir w — A. Then recover ir' w 
from ir w and construct T'. Using both T and T", invalidity of A can be detected 
as soon as it occurs. 

We use the terms father, ancestor, etc. when referring to A, as it uniquely 
defines the graphs T and T' . Edges in T reflect the comparisons done by 
Compute-7t(w) for each position in w, or equivalently, each vertex of T. In what 
follows, we formalise the connection between ancestors in T and (in)equalities 
between certain symbols of w that hold under the assumption that A — ir w . 

Lemma 1. Suppose that A = ir w for some word w. Then for i = 2, . . . , n one 
of the following conditions holds: 

1. A[i] = and w[i] ^ w[i'\ for all ancestors i' of i, 

2. A[i] 7^ and there exists an ancestor i' = n[i] of i such that w[i] = w[i'} and 
w[i] ^ w[i"] for all interior nodes i" on the directed path from i to i' . 

Proof. It follows from Compute- tt: when calculating n w , Compute- tt repeat- 
edly checks, whether w[n^[i — 1] + 1] = w[i] for successive values of k, starting 
with k = 1. Precisely, it checks, whether w[i'\ = w[i] for successive ancestors i' 
of i. This test immediately ends when smallest k is found, such that it satisfies 
w[n^[i — 1] + 1] = w[i] or — 1] = 0. In the latter case w[i] ^ w[i'] for all 

ancestors i' of i and A[i) — 0, whereas in the former case A[i] ^ and there 
exists an ancestor i' — w[i] of i, such that w[i] = w[i'] and w[i] ^ w[i"\ for all 
i" ^ i, i' on the path from i to i'. □ 

This follows from the way Compute- tt works. Refer to Fig. 2 for an example. 
Using T' a slightly stronger statement can be formulated: 

Lemma 2. Suppose that A = tt w for some word w. Then either A[i] — {[i] or 
A[i] = or A[i] = f'< fc )[i] for some k > 1. 

Proof. When calculating the candidates for n[i] we look at the sequence of values 
f[i],f (2) [i], ... but whenever w[f( fe )[i]] = w[f< fe+1 )[i]], we can safely skip f ( - k+1 ^> [i]. 
The largest i' such that i' = f {k '^ [i - 1] + 1 and w[i'] ^ w[i] is i 1 = n'[i - 1] + 1 = 




□ 
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The following criteria follow from Fact 1 : if A = 
A[l] = and Vi > A[i] > 3k 



■ w for some w then 
A[i] = f (fc) [i] , 



V^>0 Vfc>0 < f( fc )[i] =4> ^ A[f^[i]] . 



(4) 
(5) 



Conditions (4)-(5) are necessary and sufficient for A to be a valid tt array [6] . 
They yield an algorithm for testing whether A is a valid 7r table and calculating 
the minimal size of required alphabet [8] . 

Instead of checking whether there is j on the path from i to the root such 
that j = A[i] and j is a valid candidate for one can store all the valid 
candidates for i at the node i and do the required checks locally. It turns out 
that the sets of valid candidates satisfy a simple recursive formula [7] : 



cand[i] = 



{0,f[z]} U (cand[f[*]] \ {A[f[i]]}) if A[i] + 
{0} if A[i\ = 



(6) 



else 



Moreover, in (6), 
cand[f[i]] \ {A[{'[i]]} in- 
stead of cond[f[i]] \ {A[f[i]]} 
can be used, leading to a 
more sophisticated algorithm 
Validate(A) [7]. It runs in 
linear time and space, and 
has 0(min(n, \E\)) delay [7]. 
Note that the running time 
bound is not obvious, as a 
set of candidates is kept at 
each node. It can be bounded 
by noting that each valid 
candidate corresponds to one 
non-trivial transition of the automaton recognizing the language S*w, and the 
number of such transition is linear, regardless of the size of S [14]. Moreover, 
the minimal size of S is O(logn) [12, Th. 3.3a], of which authors of [7] were 
unaware. 

Let d' denote the depth of T' . In our algorithm Validate- tt- RAM, we exploit 
the fact that d' € O(logn). It follows easily from the following lemma: 

Lemma 3. If A = n w , then f'( 3 )[i] < ^ for all i. 

r( 2 )r 



Validate(A) 

if A[l] = then 

error A not valid at 1 

cand[\] <- 0, w[l\ <- 1 
for p <— 2 to n do 
if A[p] = then 

alph[p] <— alph[{[i]] + 1 
max-alph <— max(max-alph, alph[p\) 
w[p] <— alph\p] 

cand[p] <- cand[f(p)} \ {A[{(p}}} U f(p) 
if A[p] £ cand[p] then 

error A not valid at p 
w[p] <— u>[A[p]], alph[p] <— a/p/i[f[z]] 



Proof. First observe that if 7r^[n] > § 
Indeed, both n — 7r[n] and n — 7r 2 [n] are periods of w[l . . n]. Since their sum is 



1 then w[7r TO [n] + 1] = w[7ri/[n] + 1]. 



at most | + 1 + | = n 
is a period, hence ir[n] 



1, by periodicity lemma also gcd(n — 7r[n], n 



7r( 2 )[n]) 



r(2) 



[n] is a period as well. Therefore 



[n] + 1] = w[7rif J [n] + 1 + 7r[n] 

Now consider a path in T such that ifc 
showing that {'^[ii} < 



7r (2) |n 



]] = to [^[n] + 1] 



f[ife+i] and f'^ 3 ^[i^] = i\. We aim at 



6 



Pawel Gawrychowski, Artur Jez, and Lukasz Jez 



We begin with observation that w[i 2 ] ^ w[i\], i.e. that tt[i 2 ] ^ ii, by (1). 
Assume the opposite. In particular f[i 2 ] < i\. Let i = Then = ii. 

By definition this means that n'[i — 1] = i\ — 1, thus w[l . . ii — 1] is a border of 
w[l . . i — 1] and u>[i] ^ tu[ii]. But w[l . . i 2 — 1] is a border of w[l . . i — 1] as well 
and w[i 2 ] — w[ii] ^ w[i]. Since i > i 2 it holds that 7r'[«] > i 2 , contradiction. We 
conclude that w[i 2 ] ^ w[i\}. 

Take n = i 3 — 1 and suppose that n 2 [n] > § — 1. Then by first paragraph 
w[7r 2 [n] + 1] = w[n[n] + 1]. But 7r 2 [n] + 1 = f 2 [i 3 ] = i x and 7r[n] + 1 = f[z 3 ] = i 2 . 
Hence w[i 2 ] = w[ii], contradiction. Therefore Tr 2 [n] < ^ — 1. That is TT^[i 3 — 1] < 
- 1, hence n = 7r( 2 )[i 3 - 1] + 1< 

Since both f and f' are monotone functions then f ( 2 ) = i\ implies f '( 2 ' [i 3 ] < 
i\. Therefore f[ie] > «3, since f'^ 2 * 1 ^] = i\. We conclude that 

which ends the proof. □ 



By Lemma 2, either A[i] = Validate-7t-RAM(A) 

f [i] or A[i] is i's ancestor in if = q then error A not valid at j 

r. Using these observations, Bcand[l] <- 0, <- 1, d[l] <- d'[l] <- 1 

d' values of all valid candi- f or j (_ 2 to n do 

' dW-d[fM] + i,d'W-d'[fW] 

if = f[i] then d'[i] <- d'[i] + 1 
if = then 

tu[i] <— alph[i] <— a/p/i[f[i]] + 1 
max-alph <— max(max-alph, alph[i]) 
Bcand[i] <- 
else j «- LA(i, d[t] - d[4[i]] - 1) 

if ^ f[j] or d'bl - d'[A[i]] then 

error A not valid at i 
if Bcond[i][d'[^[i]]] = then 

error A not valid at i 
Bcand[i] <— Scanc?[f[i]] 
Bcand[i][d'[f[i]]] <- 1 
Bcand[i] [d'L4[f[i]]]] <- 
w[i] <— a/!p/i[i] <— oZpft[f [*]] 

of i in T, and then if there is a valid candidate j for n[i] among the ancestors 
of i in T such that d'[i'] — d'[A[i]]. Clearly, A[i) is valid if and only if all three 
tests are successful. 

To perform these tests efficiently, we use a data structure [1] , working in RAM 
model, that supports the level ancestor query in any tree: LA(i, A) returns the 
ancestor j of i that is A levels above i. The data structure also supports addition 
of new leaves and takes only constant time per any of these two operations. To 
use this data structure, each node needs two additional fields, d and d'- 



dates for except f[i], can 
be stored at vertex i instead 
of the candidates themselves. 
Vertex i stores them encoded 
in a bit vector Bcand[i}. The 
j-th bit of Bcand[i] is set to 
1 if there is a valid candidate 
i' for ir[i] with d'[i'] — j, and 
otherwise. The depths in T' 
can be encoded in a bit vector 
using only a constant number 
of machine words. To validate 
A[i], we check if d'[A[i]]-th bit 
of Bcand [i] is set to 1 and per- 
form two further tests: first, 
we check if A[i] is an ancestor 
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Theorem 1. Validate-7t-RAM works in linear time with constant delay. It 
uses linear number of machine words. 

Proof. First of all note that the calculations of d' are proper, as by (3) and easy 
induction it holds that 



Update of the information kept in Bcand is correct, as it follows directly the 
recurrence relation for sets of valid candidates (6). 

The memory usage is obvious, as only a constant number of machine words 
per node is used. The same applies to the running time: only a constant number 
of operations per letter of input is performed, and the level ancestor query takes 
only constant time. Additional cost of maintaining the data structure for those 
queries is only a constant per position read. 

Now we inspect the problem of checking whether A[i] is the unique node 
among ancestors j of i such that d'[j] — d'[A[i]] that is a valid candidate for 
We show that the valid candidate is the one among those vertices which has the 
largest depth in T. Consider j that is a valid candidate for and j' such that 
f[/] = j. Suppose that d'[j] = d'b'1- Then by (7) A[f] = j. But then by (4) 
A[i] ^ j' implies A[i] ^ A[j'\ = j, contradiction. So d[f] d'[j]- Note that by 
(7) the ancestors of i of the same depth in T" are consecutive nodes on the path 
from i to the root. We conclude that the valid candidates j for n[i] of fixed d'[j] 
is the one among the ancestors of i of fixed d' that has the largest depth in T. 

Consider the path from i to the root in T. It consists of blocks of nodes 
such that n\j] — {[j]. By (7), positions in each block have the same d', and 
d' decreases by 1 when on the block's end. Suppose that j = {[j 1 ] is a valid 
candidate and that it is not the first vertex in its block. By (5), if A[i] ^ j' , then 
A[i] ^ A\j'] = j, contradiction. So if j is a valid candidate, it must be the first 
vertex in its block. 

To prove the correctness, it is enough to show that Validate- 7T-RAM cor- 
rectly recognises whether A[i] is one of i's valid candidates for First of all, if 
A[i] 7^ 0, it is checked whether A[i] is an ancestor of i in T: using level ancestor 
query the ancestor at the depth of A[i] is recovered, if it is different than A[i], 
A is invalid at i. To check whether A[i] is the first vertex in the block of vertices 
of the same d', recover the previous vertex on the path from i to the root and 
check if it has different d'- If not, A is rejected. If A[i] is the first in its block, 
check whether there is a valid candidate in that block. If so, then clearly A[i] is 
the one. □ 

Both Validate-7t(^4) and algorithm from [7] use a linear number of machine 
words, i.e. (9(nlogn) bits. It can be shown that at least Q(ri) bits are necessary 
in the streaming setting, i.e. when successive input values are given one by one 
and cannot be re-read. 




(7) 
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Theorem 2. Deterministic streaming verification of it or it' array of length n 
requires f2(ri) bits of memory. 

Proof. Since and n[i — 1] + 1 are always valid values for n[i], then there are at 
least 2™ _1 7r arrays of length n. Assume that § — 2 bits are enough. It means that 
there are two different valid prefixes of it arrays of length n/2 (iri and ir 2 ) after 
reading which the algorithm is in the same state. Let i < § be any index at which 
those two prefixes differ: 7Ti[z] < 7r 2 [i] < i. We append 0, 1, 2, 3, . . . , i, ^[i] + 
1, 0, 0, . . . , to both of them (7r 2 [i] + l is at position | + i + 2). After reading both 
input sequences the algorithm is in the same state for both of them. But exactly 
one of the resulting arrays is valid: consider 7Ti[^ + i + 2] — just before reading 
7r 2 [?] + l < z + l, all possible candidates for 7ri[| + i + 2] are i + 1, ir{i] + 1, n 2 [i] + 1, 
... . And z + l > ir 2 {i] + 1 > iTi[i] + 1, thus the sequence is invalid. What is left 
to show is that ir 2 [i] + 1 is a valid candidate for ir 2 at position ^ + i + 2. Since 
(ir 2 [i\ + 1) ^ i + 1 then it should hold that ir 2 [i] + 1 ^ n 2 [i + 1], by (4). But 
n 2 [i + 1] = and n 2 [i] + 1 > 0. 

So, we get a contradiction: the algorithm is in the same state in both cases 
and thus cannot be correct. □ 

Although we do not know if O(n) are enough, we are able to show that the 
total memory usage can be reduced to just O(nloglogn) bits, i.e. a sublinear 
amount of machine words. The algorithm remains real time, the delay is still 
constant and so the total running time is linear. 

In order to reduce the memory usage, we cannot store the values of f [i] for 
each i, as this may use J?(nlogn) bits (consider text a n ). Hence we store {'[i]. 
It turns out that there cannot be too many different large values of f' and that 
they can be all stored (using some clever encoding) in O(nloglogn) bits. 

To implement this approach we adapt Validate- 7r so that it uses f' instead 
of f. Let us take a closer look at Validate- tt (A). Assuming that is has already 
processed the prefix A[l . . i — 1], we know it is a valid tt array. Thus we may 
calculate its corresponding n' array, denoted by A'[l . . i — 1]. By Lemma 2 the 
values of A[l], A[2], . . . , A[i — 2] arc not needed: whenever A[i] ^ and A[i] ^ 
A[i — 1] + 1, we check whether A[i] is among the ancestors of i in T' . If so, we 
verify whether no lower ancestor of i in T' was assigned the same letter. 

In order to present the details of this algorithm we need to understand the 
underlying combinatorics of n' . The following lemma shows that different large 
values of it' cannot be packed too densely, which allows us to store information 
about different positions in a more concise way. 

Lemma 4. Let k > and consider a segment of 2 k consecutive indices in the 
tt' array. At most 48 different values from [2 k ,2 k+1 ) occur in such a segment. 

Proof. First notice that each i such that n' [i] > corresponds to a non-extensible 
occurrence of prefix w[l . . Tr'[i]], i.e. ir'{i] is maximal among j such that w[l . . j] 
is a suffix of w but w[j + 1] ^ w[i + 1]. 

If k < 2 then the claim is trivial. So let k' — k — 2 > and assume that there 
are more than 48 different values from [4 • 2 k , 8 • 2 k ) occurring in a segment 
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of length 2 k . Then more than 12 different values from [4 • 2 k , 8 • 2 k ) occur in 



a segment of length 2 . Split this range into three subranges [4-2,5-2), 
[5 • 2 k ' , 6 • 2 k ' ) and [6 • 2 k ' , 8 • 2 k ' ) . Hence at least 5 different values from one 
of those such subrange [£, r) occur in this segment, for some £, r. Note that 
r < \£ — 2 k . Let these 5 different values occur at positions p\ < ... < p$. 
Consider the sequence pi — 7r'[pi] + 1 for i = 1, . . . , 5: these are the beginnings 
of the corresponding non-extensible prefixes. In particular all these elements are 
pairwise different. Each sequence of length 5 contains a monotone subsequence of 
length 3. We consider the cases of increasing and decreasing sequence separately: 



1. there exists pi 1 < pi 2 < pi 3 in this segment such that pi 1 — "^'[PiJ > Pi 2 

7!"'bi 2 ] >Pi 3 -T'bis]- 



w[l . . . nf 



<2 fc ' 



Fig. 3: Proof of Lemma 4, increasing sequence. 



Both a and b are periods of w[l . . s] (see Fig. 3). As s > I and a < b < r — £, 
condition r < \l — 2 k ensures that a,b < |. Thus by periodicity lemma 
b — a is also a period of w[l . . s]. But then w[pi + 1] = w\p\ + 1 + b — a], so 
x = y making the value of 7r'[pi] incorrect. 

there exists pi < P2 < Pz in this segment such that p il — 7r'[pii] < Pi 2 — 
ir'[Pi 2 ] < Pi 3 - 7r'\p i3 \. 









w[l...n\ 


X 


xl | 




y. 










a ^ 























<2 k ' 
Ph Pi 2 Pi: 



Fig. 4: Proof of Lemma 4, decreasing sequence. 
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Let si = Tr'lp^] and s 2 = 7r'[pj 2 ] (see Fig. 4). Then Sj > t by assumption. 
Moreover a + b < r — £ + 2 k , so condition r < |^ — 2 fc ensures that a + b < 
£/2 < Si/2 for i — 1,2. As Si + 1 s 2 , there are two subcases: 

(a) si < s 2 : then b is a period of w[l . . Si + 1] and w[si + 1] = w[s x + 1 — 6]. 
Because a is a period of w[l . . si] and a,b < 4^, w[si + 1 — b] = w[si + 1 — 
b — a]. As 6 is a period of w[l . . si + 1], w[si + 1 — b — a] = w[si + 1 — a]. 
Thus x = y, making the value of 7r'[pi] incorrect. 

(b) si > s 2 : similarly a is a period of w[l . . s 2 + 1] and «;[s 2 + 1] = w[s 2 + 
1 — a]. Because b is a period of w[l . . s 2 ] and a,b < it holds that 
u>[s 2 + 1 — a] = w[s 2 + 1 — a — b]. As a is a period of w[l . . s 2 + 1], 
w[s 2 + 1 — a — 6] = u>[si + 1 — 6]. So a;' = y', making the value of 7r'[p 2 ] 
incorrect. □ 

This observation on the combinatorial property of tt' allows us to state the 
promised algorithm with constant delay and O(nloglogn) bits of memory usage. 



Organisation of memory For each position x we would like to store: f'[x], 
w[x], the list of all its ancestors and a bit vector of flags denoting those of 
ancestors that were valid candidates for ir[x] see Fig. 5. This requires 0(b 2 (x)) 
bits, where b(x) = logf'[a;], which is too much for us. Observe that all this 
information, except w[x], depends solely on f'[x]: it contains the list of ancestors 
in T" and the list of valid candidates for tt[x], which depends only on the list of 
candidates for 7r[f'[x]], by (6). Thus for each position x we store w[x] and b(x) 
instead of f'[x]. For technical reasons we also store an k such that A[x] = i'^[x] 
and a flag denoting whether A[x] — 0. Since the alphabet size is O(logn) [12] 
and the number of ancestors in T 1 is logarithmic (by Lemma 3), the memory 
usage is 0(n log log n) bits. 



2b(x) 



26(x) 



6(x)-l b(x)-l b(x) b(x) 



w[x\ plQl.,.0101] 



active 



ancestors 



Fig. 5: Information needed for a single value of i'[x]. 

We now estimate the size of the space needed for a single value of f'[x]. By 
Lemma 3, f'( 3 )[x] < i -^r- Hence the binary encodings of at most 2 ancestors 
of x in T" have length exactly k. We reserve exactly two chunks of k bits for 
each possible candidate of length k = 1, . . . , b(x). Hence the total number of bits 
associated with a single position f'[x] is 0(b 2 (x)). 

The bit vector is the Bcand bit vector of Validate- 7T-RAM with sole ex- 
ception: we keep Bcand'[x] = Bcand[x] \ {0, A[x]} instead of Bcand[x\. It is easy 
to see that it satisfies the relation: 



Bcand' \x\ 



{x} U {Bcand 1 ' [{' [x}}\ {A[x}}) 



if A[x] ^ 
if A[i\ = 



(8) 
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We use the table 5cand'[f'[x]] in order to check whether A[x] is a valid candidate 
for A[x]. Clearly is a valid candidate and ^4[f'[x]] is not. 

For each possible value of a = b(x), we allocate 48 ^Sy blocks of 0(a 2 ) bits. 
These are called a-blocks. Each group of 48 consecutive a-blocks corresponds 
to a segment [£2 a , {£ + l)2 a ) of the input. A single a-block consists of the value 
of {' and the information described in the previous paragraph. The amount of 
allocated memory is 0(J2 a ^k ?) = 0(n) bits. 

Validating input Consider x such that 6(x) = a and x e [£2 a , (£ + 1)2"). We 
want to check if A[x] is a valid candidate for ir[x]: values A[x — 1] + 1 and are 
always valid; otherwise A[x] has to be an ancestor of x in T", by Lemma 2. 

To retrieve the information associated with position x, we first calculate 
a = b{x) in 0(1) time. This gives the offset in memory where a-blocks for a 
segment [£2 a , (£ + 1)2 Q ) are stored. Then we look up a-block number of x, which 
allows accessing information on ancestors of x. We check if A[x] is among them. 
There are two positions in the block on which A[x] may be stored; their distances 
from the beginning of the record can be calculated using a constant number of 
arithmetic operations. If A[x] is one of the ancestors, we check if it is a valid 
candidate for n[x] by a look-up in the bit vector. Clearly it takes only 0(1) time. 

Then we store the values f[x + 1] and f'[x + 1], which are needed for x + 1. 
The latter value can be calculated easily using (3). 

Update Suppose we add a new position x. Then we search the corresponding 
range of 48 a-blocks to see if f'[x] is already stored. If f'[x] is not present, we 
reserve the next unoccupied block for f'[x]. The list of ancestors can be recreated 
from list of ancestors of f'( 2 )[x]. The list of valid candidates is the list of valid 
candidates for f'[(2)][x] with the flag for A[f'[x]] set to 0, according to (8). This 
can be done in constant time, as for f'[x] we store the number k such that 
£/(fe) [f'[x]] = A[x\. Then we set the flag denoting whether A[x] = 0, and calculate 
k such that A[x] = {'( k \ Both operations can be done in constant time. 

Running time; lazy copying We cannot copy eagerly, as there might be as 
much as log n machine words to be copied. We use a lazy approach instead, 
keeping a list of memory regions (each possibly consisting of many machine 
words) that should be copied. After processing each index, we copy a constant 
number of words from this list. 

Assume that ab(x) words need to be copied for a single index x, and after 
processing each position /3 words from the list are copied. Whenever there are 
many elements on the list, we choose the one corresponding to the smallest value 
of b(x). For that we keep a separate list for each possible value of b(x) and a bit 
vector indicating which lists are non-empty. To make things working, we need to 
ensure that information on x is eventually copied, but we cannot start copying 
it too early, as it might block copying other information which we will need 
much sooner. Therefore we start copying the words associated with x such that 
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a = b(x) and x G [£ ■ 2 a ,(£ + 1) ■ 2 a ) after processing (£ + 1) • 2 a . We later show 
that we finish before copying all those words before processing (£ + 1) • 2 a . 

Ensuring that information on x is copied, but not too early, is a little tricky. 
For each possible value of b(x) we keep another list of waiting memory regions, 
a bit vector indicating which of those lists are non-empty, and a bit vector 
indicating which of those lists should be merged with the corresponding lists of 
regions ready to be copied. 

After processing an index divisible by 2 b ^ , we would like to move all waiting 
memory regions from the corresponding lists to the lists of regions that should 
be copied. Although moving one list take only a constant time if we use a simple 
single-linked implementation, there can be many lists to take care of. Hence we 
just mark those non-empty list that should be moved. Then whenever we need 
to extract an element with the smallest value of b(x), we look at both lists of 
elements ready to be copied and those waiting lists. If this list is marked as one 
to be copied, we move all its elements before adding a new one. 

We know that the total amount of information to be copied is linear (in terms 
of code words), so we have enough time do it. Our concern is that we must bound 
the delay between processing an index and copying all its associated information: 

Lemma 5. All information associated with x such that 7 = b(x) and x G 
[£2 7 , (£ + 1)2 7 ) is successfully copied before processing {£ + 2)2 7 . 

Proof. We know that the total number of machine words to be copied is linear, 
so there is enough time do it. The delay is our only concern. 

We call the memory chunks for x's such that b(x) = 7 the -f-chunks. Suppose 
that there are several procedures responsible for copying memory chunks: proce- 
dure C0PY-7 is responsible for copying 7-chunks. After processing each position 
we copy a/3 memory words, where (3 is an appropriate constant which we will 
calculate later, and there are 07 machine words to be copied for x. Imagine we 
are given a/3 credit after each position; note though that this is a worst-case 
analysis, not an amortised one. 

We run the procedure C0PY-7, where 7 is the smallest value such that some 
7-chunks are to be copied. If C0PY-7 did not use all the credit, we repeat the 
process (for larger 7) until we run out of it. 

Consider the procedure C0PY-7 and the interval [£2 7 , (I + 1)2 7 ). All the 
information to be copied while this interval is read is ready before we process 
position ^2 7 . Since C0PY-7 can use all the credit for this interval that was not 
used by Copy-(7 — 1), . . . , COPY-O, we subtract an upper bound on the credit 
used by them from the a/32 7 credit we are given for processing this interval. 

Let c 7 be the maximal credit used by COPY-O, . . . , Copy-(7) on an interval 
of length 2 7 assuming they do not run out of credit. Then a/32 7 — 2c 7 _ 1 is the 
credit available to Copy-(7) on this interval: the credit used by Copy-0, . . . , 
Copy-(7 — 1) on interval of length 2 7 consists of credit used by them on two 
intervals of length 2 7_1 . On the other hand the credit released is a/32 7 . 

We give a recursive formula for c 7 . We can assume that c = 1. Then by 
Lemma 4 C0PY-7 copies at most 48 records, each of them consisting of at most 
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a-f machine words. Let us upper bound the credit used by Copy-0, . . . , Copy- 
(7 — 1) on an interval of length 2 7 . Divide this interval into two sub-intervals of 
length 2 7_1 . Then by definition on each of them Copy-0, . . . , C0PY-7 used at 
most c 7 _i of credit and so 

c 7 = 2c 7 _i + 48a7 : 

By standard techniques, c 7 = 0(2 7 ). Let (3 be such that c 7 < /32 7 . We now 
show by induction on 7 that 48«7 < a/32 7 — 2c 7 _i, i.e. the credit available to 
Copy-(7) is enough to pay for the copying it should perform. It trivially holds 
for 7 = 0. Now consider 7 + 1: 

a/32 7+1 - 2c 7 = a/?2 7+1 - c p+1 + 48a 7 > 48^7 . 

So there is enough time to copy all the required information. □ 

As a consequence, each list contains at most 48 elements, so the total size of 
the memory required to perform lazy copying is just C(log 2 n). 

To make use of this lazy copying, we must remember about a few details. 
Just before we put pointer to the block of memory corresponding to x on the 
list of chunks that should be copied, we copy the value f'[.r] and light a special 
flag meaning that its contents is still processed. When we want to extract some 
information about x from the memory, it might happen that the information 
is not copied yet. In such case we look at the block of memory corresponding 
to its f'. If it is also not ready yet, we look at its f'( 2 \ and so on. Lemma 3 
guarantees that f'( 5 )[x] < < |. By Lemma 5, the information associated 
with f'( 5 ) [x] is copied after processing at most § indices, well before considering 
x. Thus a constant number of lookups is enough to get to an ancestor who stores 
a complete list of his own ancestors. This gives an algorithm both time optimal 
and using a sublinear amount of machine words: 

Theorem 3. tt array of length n can be validated online in real time using 
(^(n 10 ^ 1 "^" ) — o(n) machine words of (log n) bits. 

4 Online strict border array validation 

While we already know a simple algorithm 
COMPUTE-7r-FROM-7r', this algorithm is not 
online. Therefore we need another approach. 
In general, we assume that A' is a valid tt' w for 
some word w, recover ir w out of A' and then 
run Validate- 7T on ir w to calculate the word 
w and the minimal size of the required al- 
phabet. In the following we present algorithm 
Validate- 7r' that performs this task. 




Fig. 6: Maximal consistent func- 
tion. 
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Overview of the algorithm Imagine the array A' as the set of points (i, A'[i]). 
We say that a table A[l . . n + 1] is consistent with A'[l . . n] iff the following two 
conditions hold 

(11) A[l . . n + 1] = tt w [1 . . n + 1] for some word w[l . . n + 1]; 

(12) <,[l..n] = A'[l..n]; 

The algorithm keeps a maximal function consistent with A[l . . n+l] for . . n], 
i.e. the one satisfying the condition: 

(13) for every B[l . . n + 1] such that B is consistent with A'[l . . n] it holds that 
A\j]>B[j]foij = l,...,n+l , 

see Fig. 6. Note that after reading n input symbols we recover A[l . .n + l]. To 
express shortly that A is a maximal candidate we use a notation A[l . . m] > 
B[l..m] to denote that A[j) > A[j] for j — l,...,m. The invariant of the 
algorithm is that the computed function A satisfies conditions (il)-(i3). 

We think of A as a collection of maximal slopes: a set of indices i, i+1, . . . , i+j 
is a slope if A[i + k] = A[i] + k for k = 1, . . . , j. Note that, by (1), A[i + j + 1] ^ 

+ j] + 1 implies that A[z + j] = + j]. When a new letter is read, we have 
to update A or claim that A' is invalid. It turns out that only the last slope has 
to be updated. 

It can be shown that (il) (i3) imply a stronger property, which is essential 
for our analysis. 

Lemma 6. Let A[l . .n+l] be a maximal function consistent with A'[l . . n] and 
B[l . .n + 1] be consistent with A'[l . .n}. Let i be the first position of the last 
slope of A. Then A[l..i- 1] = B[l..i-\\. 

Proof. If there is only one slope, there is nothing to prove. If there are more, 
consider i — 1 — the last element on the second to the last slope. Since this 
is the end of a slope, then by (1) A'[i — 1] = A[i — 1]. On the other hand, 
consider B[l . . n + 1] as in the statement of the lemma. By (i3) it holds that 
A[i-1] >B[i-l]. Thus 

A'[i - 1] < B[i - 1] < A[i - 1] = A'[i - 1] , 

hence B[i - 1] = A[i - 1]. Let B[l ..n+l] = n w >[l . .n+ 1]. Using Compute- tt- 
From-7t' we can uniquely recover tt w i [1 . . i — 1] from tt^, [1 . . i — 1] and ir w i [i — 1]. 
But those values are the same for A[l . . i—1], thus A[l . . i — 1] = tt w /[1 . . i—1]. □ 

When new value A'[n] is read, it may hap- 
pen that A[l . . n] does not satisfy (i2): by (1)- 
(2)A'[n] = A[n] or A'\j] = A'[A[n]] holds if 
A[l . .n+l] is consistent with A'[l . . n]. Then 
we adjust the values of A on the last slope. 
Suppose that there is some other valid can- 
didate function Ai[l..n + 1]. Since by in- 
variant (i3) for j = l,...,n it holds that 




Fig. 7: Candidate decreasing. 
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A[l . . n] > A 1 [1 . . n] and for j £ [1 . . n] it 
holds that A[j] > Ai\j]. In order to replace A 

by some valid candidate we have to decrease it at some point, see Fig. 7. Let i de- 
note the beginning of the last slope of A. By Lemma 6 A[l . . i — 1] = A\[l . . i — 1]. 
So in each step we decrease the value of A[i] by the smallest offset, so that 
A[l . . n] > A\[l . . n] is kept. It can happen that at some index j it holds that 
A'\j] > A\j]. Then A\ < A'[j] and as A\ is chosen arbitrarily, A' is invalid. On 
the other hand it may turn out that A'[j] = A\j\. In such case Ai[j] = A[j] and 
we shorten the last slope: by (1) A'[j] = A[j] implies A[j + 1] < A\j] + 1. 

Information stored The algorithm stores: 

— the input read so far, i.e. a prefix of A', 

— suffix tree for A', created online [11,15], 

— number n denoting the number of read values of A' , 

— first position i on the last slope, 

— table A[\ . . i — 1], those values are fixed, 

— candidate value A[i], it may be changed later. 

The algorithm also uses implicit values of = for j = 1, . . . , n—i+1. 

These values do not need to be stored in the memory. 

Validate is run for Aw[l (or A[l ..i], if A[i] = 0), i.e. on values of 

A not changed later. Since by (il) A is a valid border array Validate cannot 
call an error. It is run in order to calculate the minimal size of the alphabet, 
letters of the word w and a set of valid candidate for Note that even valid 
candidates for n + 1 can be calculated as by (6) set of candidates for is 
expressed in terms of candidates of {[i] < i. Moreover, since A is fixed for f[i], 
the set of candidates for i is calculated just once. Since Validate is an online 
algorithm, we may feed it with values of A as soon as they are stored. 

Update and adjusting the last slope When next value A'[n] is read we 
check whether A'[n] = A'[A[n]]. Otherwise A ceased to be a valid border table 
or i last slope is defined improperly. Hence we adjust A on the current last slope. 
When adjusting the last slope we aim at satisfying two conditions 

< A[j] and A'[j] = A'[A[j]] , (9) 

for each j g [i . . n\. These conditions are checked by two queries: the height-query 
returns the smallest j £ [i..n] such that A'[j] > A[j]; the value-query answers 
whether A'[i . . n] = A'[A[i] . . A[i] + (n — i)]. Note that the second query is just 
a short way of asking whether for j £ [i..n] the second condition of (9) holds, 
as A[j] = A[i\ + (j - i). 

We ask both queries until the height query returns no index and value query 
returns yes. If the height-query returns an index j such that A'[j] > A[j], then 
we reject the input and call an error. If A'[j] = A[j] then we check (naively) 
whether A'[i . .j - 1] = A'[A[i] . . A[i] + (j - i - 1)]. If not, we reject. If it holds 
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then a new slope [i . . j] and a new last slope [j + 1 . . n] are created, we store 
values A[i ..j], i is set to j + 1 and we set A[i] to the largest possible candidate 
value for ir[i]. Then we continue adjusting. 

If value-query answers no then we set the value of A[i] to the next largest 
valid candidate value for ir[i] and continue with adjusting. If A[i] = then there 
is no such candidate value and we reject. 

Lemma 7. After reading a valid strong prefix array A'[l . . n] Validate-7t' sat- 
isfies conditions (il)-(i3). Otherwise Validate- tt' rises an error. 

Proof. We proceed by induction on n. If n = 0, then clearly A[l) = and all the 
invariants trivially hold and A' is a valid tt' array. 

Whenever a new symbol is read, then Validate- tt' checks the second condi- 
tion of (9) for j = n. If it holds, then no changes are needed because: 

Condition (il) holds trivially: we implicitly set A[n + 1] = A[n] + 1, which 
is always a valid value for tt[h + 1]. 

Condition (i2) holds: since A'[n\ = A'[A[n]\ < A[n], by (2) A[n] is properly 
defined. 

Condition (i3) holds: consider any B[l..n+ 1] consistent with A'[l..n]. 
By induction assumption (i3) holds for A[\ . . n] hence B[n] < A[n]. Therefore 
B[n + 1] < B[n] + 1 < A[n] + 1 = A[n + 1]. Thus A is still maximal. 

Suppose that the second condition of (9) is not satisfied for j = n. Then 
the algorithm starts adjusting A. Consider first a case when there is a function 
consistent with A'[l . .n]. We show that during the adjusting (i3) holds — i.e. 
for every Ai[l . .n + 1] consistent with A'[l . . n] it holds that Ai[l . .n + 1] < 
A[l . .n + 1], even though A[l . . n + 1] may not be consistent with A'[l . . n]. 
Moreover, during the adjustments (il) is preserved. In the end we show that 
when the adjustments stop then (i2) is satisfied. 

So let A\ be as described earlier. By Lemma 6, A[\ . .i— 1] = Ai[l . .i— 1]. The 
algorithm repeatedly checks conditions (9) using height-query and value-query. 
Suppose that height query returns j such that A\j] < A'[j]. Then, since (i3) 
is satisfied, A\\j] < A[j] < A'[j], i.e. A\ is not a valid 7r table. Contradiction. 
So suppose that height-query returned j such that A[j] = A'[j]. Then similarly, 
by (i3), A\[j\ < A[j] = A'[j] but as A\ is a valid tt table, it also holds that 
Ai[j] > A'[j]. So A[j] = A'[j] = Ai[j], and by (1) j is an end of a slope for 
A\. Since A\ is chosen arbitrarily, this holds for any table B consistent with A'. 
Then, by Lemma 6, = Since for j G — 1] it holds that 

A[j] > A'\j] then, by(2), A[j] and A'[j] must satisfy equation A'[j] = A'[A[j]]. 
If this equation is not satisfied by some j then clearly we reject the output, as 
A\ is invalid as well. Then the algorithm sets i to j + 1 and sets A[i] to the 
largest possible candidate value for ir[i] smaller than A[i — 1] + 1. Note that 
Ai[i] < A[i — 1] + 1 by (1). The implicit values A[j] for j G [i + 1 . . n] satisfy 
A[j] = A[i] + (j — i) . Thus (il) is still satisfied — all the changed values of A 
are valid 7r candidate values at the respective positions. Since Ai is a valid 7r 
table, 

MV\ < + (j -i)< A[i] + (j -i) = A[j] , 
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i.e., A still satisfies (i3). 

If for all j £ [i. .n] it holds that A[j] > A'[j], then the algorithm asks the 
value query. Suppose it is not satisfied for some j, i.e. A'[j] ^ A'[A\j]]. Hence 
At\j] + A[j], as by (l)-(2) cither A x [j] = A'[j] < A\j] or A'[j] + A'[Ax\j]] . We 
show that Ai[i] ^ A[i] — assume otherwise and consider the smallest j' such 
that A^f + 1] ^ A[f + 1]. Since (i3) holds 

Attf + 1] < A[f + 1] = A[j'} + 1 = Atf] + 1 . 

Hence A^f + 1] < A[f] + 1. By (1), A'[f] = A^f] = A[f], contradiction. On 
the other hand, Validate- tt' sets A[i] to next largest valid candidate for 
So A[i] > Ai[i] still holds. Since the implicit values satisfy A[j] = A[i] + (j — i) 
for i £ [i + l..n], then also A^j] < A x [i] + (j -i) <Ai + (j -i) = A[j]. So (i3) 
still holds for A. Note again that A[i . . n + 1] were all assigned valid candidates 
for tt at their respective positions, so (il) still holds. 

Now we show that when both conditions (9) are satisfied (i.e. when all the 
adjustments are finished) invariant (i2) is satisfied as well. Note that (1) and (2) 
give the following formulas for tt' in terms of tt: 

q -j | ^[j] ^ 3 i s the last clement of some slope , 

'" 1 7r'[7r[j]] in other cases . 

Those formulas are verified for j when A[j] is stored. For j such that A[j] are 
implicit values, i.e. such that j is on the last slope of A, those are verified by 
value-query. Hence when all adjustments are finished, (i2) holds. 

Suppose now that the input is not a valid strong prefix table. We show that if 
Validate- 7r' accepts the input then A' is valid. Since (il) was preserved during 
the adjustments, A is a valid tt table. Moreover, for each position j conditions 
(9) are satisfied — the adjustments of the slopes ends when they are satisfied 
for each position. So A' is a valid candidate for tt' w such that tt w = A. □ 

Theorem 4. Validate-7t' correctly computes tt w , such that A 1 — tt' w and cal- 
culates the required minimum size of the alphabet. 

Proof. By Lemma 7 Validate- tt' raises an error if the input table is invalid and 
otherwise returns the size of the alphabet. So we need to show that the size of 
the alphabet is computed correctly. 

By invariant (il), A[l . .n + 1] = n w [l ..n + 1] for some word w and by 
invariant (i2) tt' w = A'. Word w[l..i — 1] is in fact created by Validate- tt. 
So two questions remain: is the alphabet required for w minimal for A' and is 
the answer given by Validate- tt really the alphabet needed for iu[l..n], (as 
Validate- tt is run on the prefix A[l . . i — 1] only). 

Suppose first that A[i] > 0. Thus Validate- tt run on A[l . . n] returns the 
same size of required alphabet as run on A[l . . i — 1], as new letters are needed 
only when A[j] — at some position, and A[j] > for j on the last slope. So 
consider any B[l . . n + 1] consistent with A'[l . . n]. Then by Lemma 6 B[l . . i — 
1] = A[l . . i — 1]. Clearly Validate- ttB[1 . . i — 1] reports the required size of 
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the alphabet not larger than Validate- nB[l . . n]. So indeed A does not require 
larger alphabet than B. 

Suppose now that A[i] — 0. Then for any B[l..n+ 1] consistent with 
A'[l..n], by (i3) it holds that < B[i] < A[i] = 0. Since A\j] > for j > i, the 
same argument as previously works. □ 



Answering queries Consider first height-query. The idea is that if A'[j'] — 
A'[j] > f — j > then j cannot be an answer to height-query. Using this 
observation a list of possible answers can be kept and quickly updated. 

Lemma 8. Answering all height- queries can be done in amortised linear time. 




Proof. Consult Fig. 8. The idea is as follows: consider any two indices j, j' such 
that 

A'\j']-A'\j]>f-j>0 . 

We denote this relation by j -< j' and say that j' dominates j. Then j cannot 
be an end of any slope, if A[j] = A'[j] then 

A[j'\ < A\j] + (f j) = A'\j] + (f j) < A'\j] + A'[j'] A'\j] = A'[j'] , 

contradiction. Note that j -< j' and j' -< j" implies j -< j": clearly j < j' < j" 
and 

A'tf] m > f j and A'[j"\ A'[j'} > j" j> 
summed up implies that 

A'\j"]-A'[j]>j"-j . 

This can be reformulated in terms of height-queries: if j < j' and A'[j'] — 
> f - j then A'[j] > A[j] implies A'[f] > A[j'], i.e. that the instance is 
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invalid. Hence we need not keep track of j as a potential answer to the height- 
query. It is enough to keep a list of positions ji < j 2 < ■ ■ ■ < jk such that ji -fi je 
for all i, £ and je dominates all j e [je-i + 1 ■ ■ je — 1]- 

When the query is asked we check if A[ji] < A'[ji\. We show that evaluating 
this expression for other values of j is not needed. Suppose that A'[j] > A[j] for 
somej G {je-i + 1 . .je — 1]. Then since je dominates j it holds that A' [je] > A[je]. 
Suppose now that A'[je] > A[je] for je > j\. Then since j\ < je and je does not 
dominate j\ it holds that A'[je] — A'\ji] < je — j\. As j\ and je are on the last 
slope then A\j e ] = Afa] + (j e - ji), hence 

Afa] = A[je] - (j e - n) < A\j e ] - (A%] - A' fa]) < A! fa] , 

so ji is a proper answer to the height-query So the height-query is answered in 
constant time. 

We demonstrate that all updates of the list ji, . . . , jk can be done in 0(n) 
time. When new position n is read, we update the list by successively removing 
je's dominated by n from the end of the queue. By routine calculations, if n 
dominates je, then it dominates je+i- 

A[je+i] - A[je] < je+i - je , 
A[n] - A[je] >n-j e 

imply 

A[n] - A[je+\] >n- je+i ■ 

So we have to remove some suffix of the kept list of j's. 

Suppose that je, ■ ■ ■ ,jk were removed. Then je, - ■ ■ ,jk ~< n , so j -< n for each 
j € [je-i + 1 ■ - jk — !]• Moreover je-i n and thus also j -A n for j = j\, . . . je-i- 

As each position enters and leaves the list at most once, the time of update 
is linear. □ 

To answer value-queries efficiently we construct online a suffix tree [11, 15] for 
the input table A' [1 . . n] . Answering the value queries can be done by dividing 
the query into two sub-queries, one is checked naively the other by traversing up 
in the suffix tree. It is possible to amortise both sub-queries. 

Lemma 9. Answering all value-queries can be done in O(nlogn) time. 

Proof. First of all, a suffix tree is constructed online [11, 15] for the input table 
■A'[l..n]. This takes O(nlogn) time — the construction is linear in length of 
the word but logarithmic in the size of alphabet. Since A' may have values up 
to n — 1 we have to include the logarithmic factor. 

Fix an index i and consider all value-queries asked while i was considered i 
as the beginning of the last slope. The set of valid candidates for Tr[i] is of size 
O(logn): by Lemma 2 only one candidate is not of the form f'^'^i] for some j 
and by Lemma 3 there are only O(logn) positions of such form. Hence for a 
fixed position i there are O(logn) value-queries asked. Suppose that the queries 
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were asked for candidates A[i] — £\, . . . , A[i] — £ ki . We show that the query about 
A[i] — lj can be answered in 0(£j). Then 

ki ki 
0(J2 tm) = 4J = 0{kj ki ) = 0{l k% logn) . 

m—l m—1 

Then we sum over all possible i and show that ^ - £ ki < n hence the result can 
be upper bounded O(nlogn). 



? A'[i..n\ _ 



/ \ 
' i ' 



A'[i..n - 1] 

A . 



-^A'\i..n] 



i / 



lA'\L.i 



1] 



lA'[i + l k ..n-T\ 



?A'[i + l k ..n-l] 



Fig. 9: The subqueries of value-query. Solid lines represent the known equalities between 
fragments of A' table, dashed lines represent the tests. The test and equalities should 
be read according to the arrow-heads. 



Suppose that we check whether A'[i . . n] = A'[A[i] — £j . . A[i] — £j + (n — i)]. 
If £j > (n — i), this can be done naively in 0(n — i)— 0(£j) time. If £j < (n — i), 
we divide the tests into three subtests, see Fig. 9. On one hand, we check naively 
whether 

A'[i ..i + £j-l] = A'[A[i] -£,.. A[i] - 1] , 

this is done in 0(£j) time. We also naively check A[n] = A[i] + (n — i — £j) in 
constant time. Finally we check whether 

A'[i + £j..n-l] = A'[A[i\ . . A[i\ + (n-i-£j- 1)] . 

We show how to perform this test efficiently. Since A[\ . . n]is a valid candidate 
for A'[l . . n — 1] and i was the first position on the last slope of A, then 

A'[i . . n - 1] = A'[A[i\ . . A[i\ + ( n - i) - 1] , 

by (2). Therefore 

A'[A[i\ . . A[i] + (n-i- £j) - 1] = A'[i ..n-tj-l] . 
So it is enough to check whether 

A'[i + £j . .n - 1} = A'[i . .n - £j - 1] , 
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i.e. whether A' [i + £j . . n — 1] is a prefix of A'[i . . n — 1]. This is easily done using 
suffix trees: wig. we may assume that the check is made before A[n] was added 
to the suffix tree. We enrich the suffix trees: each node has a pointer to its father. 
This does not increase the built-time. Then we go to the vertex corresponding to 
suffix A'[i . .n — 1] and traverse the tree £j letters up. We return whether suffix 
A'[i + £j . .n — 1] ends in this node. Traversing up costs 0(£j) time. 

Consider again all the queries asked at position i for valid candidates for 
equal A[i] — l\ > . . . > A[i] — i^- Then A[i] was replaced by A[i] — l^, hence 
also the value of A[n] was replaced by A[n] —Iki- Since A[n] increases by at most 
1 when n increases by 1 thus X)"=i ^ s l mcar m n - Therefore: □ 

Running time Construction of the suffix tree and answering value queries 
takes O(nlogn) time, answering height-queries takes 0{n) time. Running 
Validate(A') takes 0(n) time. Therefore the algorithm runs in O(nlogn) time. 

Pseudocode of Validate- -it' 

Validate- it' (A') 

A[1]^0, »<-l, n^O 
repeat 

n <— n + 1 

if A'[n] > A[n] then error A' is not valid at n 

A[n + 1] <- A[n] + I 

repeat 

change <— FALSE 

if there is j G [i . . n] such that A'[j] > A[j] then 

error A' is not valid at n 
if there is / e \f . . n] such that A'[j'] = A\j'] then 

let j <— minimal such j' 

if A'[i..j-l}^ A'[A[i\ . . A[i] + I)} then 

error A' is not valid at n 
run Validate- n (A) on positions i . . j 
if A[j + 1] = then 

run Validate-7t(A) on position j + 1 
for m <— i + 1 to j do 

store A[m] <— A[m — 1 ] + 1 

change <— TRUE 

if A'[i . . n] ^ A'[A[i\ . . A[i] + (n - i)] then change <— true 

if change then 

if A[i] = then error A' is not valid at n 
A[i] <— next largest candidate value, including 
until Armageddon 
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Remarks While Validate- tt produced online a word w over minimal alphabet 
such that ir w = A this is not the case with Validate- tt'. At each time-step 
Validate- 7r' can output a word over minimal alphabet such that tt' w = A', but 
it is not possible to do so online, as the letters assigned to positions on last slope 
can change during the run of Validate- tt. 

Note that since Validate- tt' keeps the function tt[1 . . i + 1] after reading 
input A'[l . . i], no changes are required to adapt it to g validation, where g{i) = 
Tr'[i — 1] + 1 is the function used in [6]. 

Open problems 

Two interesting questions remain: is there a real time algorithm for validating 
A as tt in the pointer machine model? Is there a linear time algorithm for val- 
idating A' as tt'? The latter probably requires eliminating suffix trees from the 
construction or some clever encoding of values of A' . We believe it can be done 
with better understanding of the underlying word combinatorics. 
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